home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1993
/
Internet Info CD-ROM (Walnut Creek) (1993).iso
/
inet
/
internet-drafts
/
draft-chiappa-routing-01.txt
< prev
next >
Wrap
Text File
|
1993-09-23
|
88KB
|
1,477 lines
Internet-Draft <draft-chiappa-routing-01.txt> Expires on April 1994
A New IP Routing and Addressing Architecture
J. Noel Chiappa
[ Author's note: This document is not yet fully complete, but it is being
distributed in this rough form since there appears to be a considerable
appetite in the network community for discussion on this topic. It is hoped
that this note, even in its present form, will facilitate useful debate about
the issue. ]
Status of the Memo
This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups. Note that other groups may also distribute
working documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted
by other documents at any time. It is not appropriate to use
Internet Drafts as reference material or to cite them other than
as a "working draft" or "work in progress."
Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any
other Internet Draft.
1 Introduction
This document presents one possible new IP routing and adressing
architecture. By routing and addressing it is meant that part of the overall
IP architecture which is responsible for identifying computing nodes,
where they are in the Internet, and how to get traffic from one to another. It
represents one person's view of a good overall system answer to this question,
and is not to be taken as anything more than that.
While this document will be of most interest to those with some degree
of familiarity with routing concepts, it is aimed at the general network
community, which will have to agree with the general direction taken in
evolving this aspect of the IP architecture. While the subject is a complex
one, the document has been written to be accessible to a general audience. As
such, it contains much material which will be familiar to those who are versed
in routing work, which has unavoidably lengthened the document.
The goals of the new architecture are more or less those outlined in
RFC1126 [Little], particularly Appendix A, but modified somewhat to include
what I feel are real constraints, such as more complex external policy
concerns. To review them briefly, they are to allow a system substantially
larger than the current one (if possible of essentially unlimited size), to
address the policy concerns which are of increasing importance, to allow
topologies of a much more complex nature, and to retain the flexibility of the
current system. A concise outline of the goals believed necessary are given in
the section below.
These design goals are a very agressive set of requirements. They
exceed considerably the existing capabilities of other large communication
systems, such as the telephone system. Two aspects in particular represent a
major leap. The first such aspect is the attempt to create a ubiquitous
communication system (and route traffic through it) out of a very large number
of cooperating autonomous units (although in some senses the road network does
currently accomplish this goal). The second is the assumption that the system
will be extremely dynamic; i.e. respond in real time, without human
intervention, to changes in the topology and other configuration aspects of
the system.
To meet these goals requires a substantial modification to the basic
philosophy and mechanism of the existing IP packet layer architecture. Since
the system is now widely deployed, a change of this sort - an in-place
modification to a large, operating, system - presents a substantial challenge.
To meet it it is first necessary to solve the problem of routing, and
only then turn to addressing. As will be seen, routing (and especially the
abstraction of data which will be necessary to handle a very large system) is
a sufficiently difficult problem that any way of making the problem easier is
desperately needed. Addressing needs to be completely sublimated to the needs
of the routing architecture; the form of addresses should be whatever is most
useful to the routing. Dividing the address up in such a way as to ease
administrative tasks is not correct, although a separate mechanism to handle
such administrative concerns is possible.
It should be noted that this document is not an engineering design
document. No detailed mechanisms have been laid out, and undoubtly many
pitfalls remain along the path to a complete system. Little attempt has been
made to consider optimizations for common cases and other engineering
practises; no study has been made of traffic patterns, and of course the cost
of the actual mechanisms is unknown since the design has not been done; both
are necessary to guide the cost/benefit choices inherent in good engineering.
Rather, this document contains a study which attempts to clarify what are
the fundamentals of the problem, and then identifies what seems to be the
best path available to solve that problem.
Also, it was felt that rather than proceed forward along a visible
path from where we are now, the best method (when considering the long term
health of the the network) was to act as if a blank piece of paper were
available. When the best possible design was created, it would at that point
be appropriate to see whether it was feasible to reach that design from the
current system. Proceeding from the optimal end point backward seems more
likely to result in the best long term results than proceeding forward from
the current state of affairs in an incremental fashion.
Thus, what this document is an attempt to decide on a general plan of
attack on the problem; a broad-brush architecture based on a long study of the
fundamental issues and problems in routing and addressing in a very large
network.
2 Design Goals
The list of goals for the new architecture are of three types. The
first are those capabilities that the existing architecture already has, which
are still needed. The second are things that it cannot do, and which are
proving to be major hindrances in actual operation. The third are those
necessities which will be required in the future, and which may not be obvious
now.
There are also certain longstanding goals of the IP architecture,
such as the ability to support mobile hosts, which have previously been
difficult. It proves possible to support these in the new architecture, as
will be seen.
2.1 Current Capabilities
First and foremost, the new architecture must be incrementally
deployable with respect to the existing system. The capital investment in
existing software and plant, and the day-to-day operational dependence of many
people, is so large that to start from scratch would be phenomenally
expensive, and utterly impractical. If there were no way to do it without a
fresh start, perhaps it would be practical, but since it seems to not be
necessary, incremental deployabilty is a must. This does not preclude eventual
replacement of the existing architecture, obviously; to retain it indefinitely
would unduly cramp the system.
In specific terms, a new system should not require any changes to
hosts to become fully operational (although eventual changes to hosts are
allowable to take full power from the new scheme); too many unmodifiable
existing hosts exist, and support for them must be maintained for some time in
the architecture.
Second, changes to routers (which are almost certainly unavoidable if
anything concrete is to be attained) need to be interoperable (for a period,
at least) with the existing system, although it is reasonable in the case of
routers to insist that basically all be converted before the new architecture
can be fully operational. Some feel that even changing all the routers is too
ambitious, and unnecessary, but it appears that this goal is unneeded, and
conflicts with the general principle of providing the best possible system
for the long term.
Next, the current system allows minimal firewalls between AS's; errors
in routing data from outside the AS cannot (in a properly designed AS)
interfere with communication inside the AS. While the concept of an AS is
perhaps outdated, this property is desireable, although it needs to be
strengthened and made more flexible. As things stand, a single misbehaving AS
can cause almost limitless problems in the inter-AS routing. A systematic way
of minimizing such problems needs to be included.
Finally, it must be vendor-independant; that is, specified in enough
detail that an implementation by a new vendor, operating solely from the
specifications, will interoperate correctly with other existing
implementations.
2.2 Current Limitations
The things about the current system that are most limiting are the
limit on the total size of the system, the lack of support for policy routing,
and the inability to support arbitrary topologies (although this last
restriction is being lifted by the deployment of BGP).
The size restriction has two phases. First, a key existing routing
protocol (EGP) has a small limit on the total number of networks, but, as
with the problems of topology, as BGP is deployed as a replacement, this
limit will fade.
The second set (which consist of three far more fundamental problems)
is that the system is running out of network numbers, the ability to route
them, and eventually, out of address space completely. This topic is not
treated here in detail [Chiappa91], but briefly, the first problem involves
the accelerating use of the limited number of network numbers, the second
concerns the difficulty of routing in a flat address space which is growing
(very quickly, to boot), and the third recognizes the eventual limits of the
32 bit address.
Policy routing is the phrase used to describe external, administrative
input to the data routing process. Currently, the system provides little to no
support for this. The ad hoc mechanisms being used are so complex as to
confuse all but the most skilled, and contain substantial potential for
errors. The main problem is that the existing mechanisms are fairly unrelated
to the desired goals, and the use of these mechanisms to produce the desired
goals is thus complex and often impossible; they simply do not address
existing policy requirements.
Policy requirements can be divided into three areas, which can be
called a) 'access control', control over which external users are allowed to
use a given resource; b) the 'trust model', control over which external
resources a given user is willing to use; and c) 'information hiding',
control over which external users are allowed knowledge of a resource.
While these divisions are not necessarily the theoretical minimum, they do
appear to accurately follow the needs of the majority of users; providing
these will make configuration of the system to provide the desired policies
far simpler by virtue of the close match between what is desired and the tools
available to do it.
BGP will do little to radically increase capabilities in this area.
About the only additional capability provided by BGP will be that since
complete path information to the destination is available, a limited amount of
additional 'trust model' control will be available, as will some 'access
control'. However, as the actual controls are still basically those currently
available, the ills of the existing system remain.
One thing to note is that while enough policy requirements have been
verbalized to build a general model, no complete list of what will be required
is by any means possible at this point. This means that whatever policy system
is created will have to be able to add new policies as time goes on.
This latter requirement will be discussed in more detail later. Note
also that the second policy class (trust model) implies control by the source
of the route through the network. This also has substantial implications.
The current restriction on interconnection, which prohibits cycles,
is untenable on reliability grounds (since alternate paths are eliminated) as
well as difficult to enforce. As noted, this restriction is based on the
technical limitations of the EGP protocol, and as BGP is deployed as a
replacement, this limit will fade.
Finally, another problem noted with the current system is the use of
toll networks. This includes not only the cost of running the routing
protocols over such networks, but also distributing the information that
certain paths cost money, and allowing traffic the choice of whether or not to
use such networks. In some sense this latter requirement is a variant of
policy routing, but in another it is yet another requirement, often labelled
'type of service routing', by which it is meant that different types of
traffic needs links of different characteristics. For instance, remote login
traffic needs low delay links, while bulk data transfer needs high bandwidth.
In the cases of both type of service, and access control, it is also
necessary to allow for future expansion of the types of each supported, since
it is unlikely that the complete set can be specified at this time.
2.3 Future Requirements
In the category of requirements that are not yet restrictive, but
of which we are becoming aware, one very important one is to provide a
system that is more immune to problems, of both accidental and deliberate
cause. The fact that this system will connect effectively everything means
that great reliance will be placed on it, and equally great confusion will
be caused should it fail. Engineering failures need to be prevented by
thorough simulation as well as use of large scale testbeds. Deliberate
attack needs to be prevented by the inclusion of provisions for security;
these must provide for the use of one of a (extendable) variety of optional
security mechanisms, to allow the cost and level of security to be
optimized depending on the needs and cost of failure.
Another important need is to minimize the amount of configuration
needed, to minimize the possibility of having conflicting configuration
information, and to prevent such conflicting information from causing a
problem. The existing routing architecture suffers from being overly complex
to configure. While this may not be obvious to technically sophisticated
people with long association with the network world, many new users find the
existing architecture confusing and difficult to set up. While they perhaps
expect too much, since you cannot set up a complex system with little or no
study and engineering, improvements can be made. In addition, conflicting
configuration information currently can cause severe problems; this another
reason why such information needs to be minimized. Finally, the architecture
must be robust in response the inevitable errors in configuration.
3 Routing and Addressing Fundamentals
After some consideration of the problem, it became obvious that the
most difficult of the requirements is the size of the system. While some
approaches make light of this problem, they do so at the cost of restrictions
which make other stated requirements difficult to meet. The new routing
architecture had not only to provide capabilities unheard of in the older
system, but also do so in a very large system. Providing the new capabilities
proved to be not so difficult, but the size problem was less tractable. It is
not practical to always provide complete information about every detail of
the connectivity and routing of the entire to everyone; some way of reducing
the data was necessary.
At this stage it is useful to introduce some technical terms.
'Compression' is taken to mean reducing one set of routing information to
another which, while more compact, is effectively the same; i.e. would
produce the same routing decisions in all cases. 'Thinning' is taken to mean
discarding some of the data to reduce it to a manageable size. In this case,
routing decisions taken using the thinned data are possibly non-optimal,
since some necessary information is unavailable. 'Abstraction' is a term
including either of these two possible ways of reducing the volume of data.
'Attenuation' means that information becomes less complete the further you
are from the source of the information.
However, before we examine this part of the problem in detail, it is
necessary to review some fundamentals of routing and addressing.
3.1 Routing Algorithms
The family of routing algorithms with the most promise is the
'link-state' (LS) group. They are so named because the information that is
distributed among the nodes running the routing protocol is about the
existence and state of the connecting links in the system (i.e. a topology map
of the system) rather than distances to destinations, as in the
'distance-vector'(DV) algorithms.
3.1.1 Distance-Vector Algorithms
In a DV algorithm, each node sends to all of its neighbours a
complete table of its distance (measured in some agreed upon metric(s)) to
all the destinations in the system (hence the name). For destinations it can
reach directly, it advertises some low metric; for destinations it reaches
through some neighbour, it lists what that neighbour told it plus some
increment. Each recipient node then compares the advertised distance to the
one it currently has for that destination; if what it just heard was better,
it records it and routes traffic to that destination to the neighbour node
that sent it the update.
There are many variants, which have been designed to add capabilities
or fix problems with the algorithm, the most common of which are that DV
algorithms tend to form temporary routing loops when the topology changes,
before things stabilize, and that they are unable to detect that a destination
becomes unreachable until the distance to it passes some arbitrary 'infinity'.
3.1.2 Link-State Algorithms
In the simplest form, LS algorithms distribute a complete topology
map to each switching node in the system. This is done quite simply; each
node generates a message describing the state of the links attached to it,
and this message is then quickly flooded through all the nodes in the system
using a reliable flooding protocol. A toplogy map can easily be constructed
from the messages from all the nodes in the system.
These nodes may then run in parallel an algorithm which generates
from this topology map a routing table which can be used to forward packets
to various destinations. The MILNet is currently using a LS algorithm where
the algorithm which computes the routing table is one due to Dijksta called
the 'Shortest Path First'. [Dijkstra]
However, it should be noticed that the second step, creation of a
routing table from the topology map, is purely an engineering decision. It is
not necessary to precompute the entire routing table in an LS algorithm,
whereas it is in a DV algorithm, since this precomputed routing table is the
data that is passed on to the neighbour nodes. In an LS algorithm, it is
perfectly plausible to compute routes on demand, from the topology map, and
store them in a cache. This reduces the computational burden substantially,
and, as we shall see, is a key element in the solution.
Also, in the classic LS routing scheme, the one essential is that all
the nodes run the same route generating algorithm from a consistent map
database, otherwise it is possible to form routing loops when two nodes
disagree about the best 'next hop'. However, if traffic is not routed using
the 'hop-by-hop' method, both of these requirements can be relaxed; this is
critical, as will be seen.
3.2 Routing Algorithm Choice
Several advantages were noted about the LS class of algorithm that
recommended it to the ARPANet implementors [McQuillan]. Most of these are
traceable to the fact that since the baseic DV algorithm is a distributed
algorithm for calculating routes, it is in many senses less robust and
characterizable (in a response time sense, not an ultimate stability after a
single change) than the LS algorithm, which runs in parallel locally on all
machines.
First, it stabilized quicker after a topology change. This was
important; since the ARPANet used load-based routing, the metrics on links
changed fairly frequently. The speed was due to the fact that when a link
changed state, that fact became know almost instantly everywhere (due to the
fast flooding), after which the nodes all quickly recomputed their routing
tables in parallel, at which point the routing had stabilized in the new
pattern. This is in contrast with DV algorithms, where the effects of a change
in the topology had to pass through computations in each node on its way
across the network; additionally, the process might have to be repeated
several times before the new stable pattern was achieved.
Second, it detected destinations which had become unreachable more
quickly; this enabled the switches (which offered a reliable transmission
service) to discard packets for the dead destination more quickly, avoiding
buffer congestion from undeliverable and undiscardable traffic in the network.
Third, it was less likely to form loops (formation of which had been a severe
problem under the old DV routing algorithm, due to certain ill-advised
modifications in the algorithm which had promised other gains), and quickly
broke those which did form.
These advantages were not particularly useful to this application.
Since the system was so large, it was felt that practical load-based routing
was not possible. In such a large system there would be frequent topology
changes in any case; there were concerns that with the added changes of load
based routing that the routing would be less stable. The stability is
certainly useful, but not as crucial when the metrics are essentially static.
Likewise, the other two advantages, while noteable, can be met with
modifications to the basic DV algorithms.
The DV algorithms do appear have one advantage, which is that they
are apparently more suitable to routing in large nets. Since data is
processed through the routing table before being passed on, they very easily
can be modified to provide attenutation as a way of abstracting data.
(The algorithm is quite simple; as soon as the routes to all the subelements
of a routing element have the same next hop, it is no longer necessary to
send out routing tables listing each subelement individually; a single
entry for the element will suffice.)
However, this advantage is overwhelmed by a disadvantage, which is
that they cannot easily be enriched to handle the problems of both policy
based and type of service routing (which are technically very similar) without
exponential explosion in the amount of data which must be computed and
transferred. Since route calculation in DV algorithms depends on continuous
calculation of all possible routes in a distributed computation, handling
policy considerations means that routes to all destinations under all
conceivably desirable combinations (or allowed combinations, depending on the
system design) of policy requirement must be maintained. This is the only way
to have a route available for a given destination/policy combination when
traffic for that combination appears (or within any reasonable time of such
traffic appearing, as explained below).
(A possible fix for this problem involves only computing a routing
table entry for given destination/policy combination when traffic for that
particular combination arrives. Before that traffic could be forwarded,
however, the distributed algorithm would have to run, probably to
stabilization (discovering when that happens would be a good trick in and of
itself), to calculate the next hop at the first hop. This almost certainly
presents an unacceptably long delay before the traffic can be forwarded.)
In any case, the distributed nature of the DV algorithm requires far
more line bandwidth and computational power than the LS algorithm when
policies are included; in a large network these alone will present substantial
difficulties. Use of a DV algorithm in a world with policy routing
requirements is simple infeasible.
A key component of handling the requirements is the realization that
LS algorithms can handle this problem quite easily. [Chiappa84] Since each
node has a complete map of the system, inclusing all the links, it is quite
easy to indicate on each link what 'attributes', of either policy constraints
or service type, that link possesses. Since the LS algorithms can compute
routes on demand, it is only necessary to create a route for a particular
combination of reqirements when it is needed. It is easy, and costs little,
to tag each link with its attributes before the link information is flooded
through the system.
It is this characteristic of LS algorithms, that the actual route
calculation is separate from the distribution of the data, and can be delayed,
that is the key one. The fact that the calculation can be delayed (which makes
handling complex policy routing possible) is a fundamental difference between
LS and DV algorithms, and a key reason that DV algorithms are fundamentally
unsuited for use in the new IP routing architecture.
The problem with LS algorithms is that they do require a complete map
of the toplogy. As explained above, this is impractical, due to the size of
the target system. However, it is possible to modify LS algorithms to use
abstraction, so this is the approach chosen. This decision brings other
problems, though.
It turns out that compression alone is not an adequate means of
attack on the problem. There are too many topologies that simply cannot be
represented by a simpler but equivalent topology. Since the data must be
reduced to make the problem manageable, it is necessary to discard some, and
use thinning. The ramification of this step is that when routing data is
thinned, the lost information means that in many cases the system will fail
to detect possible routes with the required characteristics. This can lead to
the creation of non-optimal routes, or in the worst case, failure to find any
route at all, even though one does in fact exist.
The hardest problem thus becomes managing the discarding of
information, based on cost-benfit tradeoffs which the protocol cannot possibly
comprehend. Simply put, data must be discarded to make the cost of running the
protocol reasonable. However, taking this step has a cost elsewhere, in
non-optimal traffic routing; the problem is a classic cost-benfit tradeoff.
Thus, thinning involves value judgements, and is thus a matter of policy as
well as a technical problem.
The ramifications of these issues will be addressed later.
3.3 Addressing Fundamentals
Before moving on to review the proposed architecture, the final piece
of background that is necessary is to briefly review the fundamental network
concepts of addresses, routes, etc.
Although a number of papers have been written on the subject of
addresses and routes [Shoch] [Saltzer82], a considerable lack of understanding
of the basic aspects of these fundamental concepts still remains. Readers need
to be clear in their minds the difference between fundamental concepts (and
the objects which are the concrete instantiations of these fundamentals), and
the different kinds of names (and the reasons for and uses of these names)
which we may give these objects.
3.3.1 Object Spaces and Names
There has been confusion as to exactly what the fundamental object
spaces of networking are; most previous architectures have tried to make do
with too few. Previous work has also pointed out that common practise in
networking has been to too tightly bind the forms of names for these objects
to the objects themselves.
Two key object spaces used in this architecture are i) the node
identifier, which indicates one particular computing node to which packets
may be delivered, and ii) the network address, which specifies an attachment
point to the network; a network interface to which traffic may be routed. (In
fact, the term "address" is usually taken to refer to a particular kind of
name for a network attachment point.) Multicast concepts are not considered
here, since multicast can be built as a layer on top of this, and is thus less
fundamental.
Each of these object types may have one or more kinds of names. The
names may or may not haves structure. Names with structure are invariably
adopted to make some mechanism (often a mapping from that name to something
else) easier to implement. For instance, a attachment point may be identified
either by a unique, fixed length binary flat identifier (e.g. an Ethernet
interface number in a bridged network), or, as is more usual, a structured
heirarchical form (e.g. a typical IP address). In addition, there are mappings
between names and objects, and between different object classes.
As was mentioned, problems in thinking about the issues of routing and
addressing can occur when these object spaces are not clearly differentiated,
or when the form of the name for an object is not separated from the concept
of the object. An example of the former is the lack of distinction in the IP
architecture between attachment points and node identifiers; this has left us
with great difficulties in handling multi-homed hosts and mobile hosts. An
example of the latter is that a network address is usually taken to be a
structured heirarchical item, but this is in fact just one possible way of
naming attachment points; a very useful name, to be sure, but not the only
one.
Whenever one has different classes of objects, and names for them,
several questions immediately arise. The first is "do we have the right
classes", second "what is the mapping between the classes of objects", third
is "how many names, and of what form, do we have for each class of object",
and fourth "what is the mapping between the names and objects".
The mappings form the heart of the power, and the problems. Two
aphorisms of computer science illustrate this: the first (due to B. Lampson)
is "All problems in Computer Science can be solved by adding another level of
indirection"; the second (due to D. Clark) is "The function of an operating
system is to establish many different names for the same object, so that it
may busy itself keeping track of the mappings between the various names."
Do we have the right fundamental objects? Exactly which names and
mappings are useful in this instance?
3.3.2 Names and Mappings
It is clear that the two object classes named are highly useful. Are
there any remaining problems, the solution to which requires yet another
object class (and associated mappings)? Only one appears possible; the problem
of heirarchical addresses implying heirarchical routes. This topic cannot be
treated in full at this point in the document; the architecture as given
solves this problem, but perhaps not in an efficient enough way. It may be
necessary to add an extra object class to deal with it. For the moment,
however, the answer appears to be that the two mentioned are enough.
As to the mappings among the classes, this also appears simple enough.
Just as in the real world, a single node may have several network interfaces,
a single node identifier may map to more than one attachment point. Also, just
as hosts may move, the mapping from a node identifier to a network attachment
point may change.
The only namespace for nodes which appears to be useful is a
relatively short, fixed length, pseudo-flat binary identifier. This name is
intended for solely for globally unique identifaction of nodes, and for
efficient processing by automatons; this dictates the form of the name. (Since
the space will probably be split up to various number assignment authorities
for each of administration, it only appears to be flat.) Since it appears (for
reasons to appear below) that the most useful form of the network address is
long and complex, having at least one name for the destination in the packet
which is easy to manipulate is desireable. There does not appear to be a use
for a different kind of node identifier at the moment, but it could be added
if need be.
This is obviously in addition to the existing domain names, which are
heirarchical character string names for hosts. These names are obviously
intended to be useful to humans, and the structure in them is intended to make
interpretation by a distributed database (the Domain Name System) easier. It
is entirely feasible that domain names which represent, for example, services
could map to more than one node identifier, but that is a different issue.
The names of network attachment points, the address, present a more
interesting problem. An address usually does two things. First, it uniquely
identifies a connection point to the network. Second, it does so in a
structured manner, which allows something useful to be done with it. Exactly
what that structure is depends on what other uses it is put to; most often the
structure of the address is used in routing.
Normally, groups of attachment points which are topologically related
are given related addresses. Usually, this allows the number of different
destinations which must be tracked in the various routing devices througout
the network to be reduced (in its simplest form, this allows routing tables to
be smaller). That is, rather than keep knowledge of all the individual network
attachment points, a routing device can keep information on collections of
network attachment points, at a considerable savings.
It may have other benefits; in this architecture, it also does several
other very important things. First, the structured address allows the point on
the topology graph which the address names to be found easily, without mapping
through any additional object class, or other attachment point name. Second,
it helps provides a representation by which topology information can be
distributed. Third, the structure of the address space provides a framework
for the abstraction process by which simplified graphs of the topology are
created.
Finally, in the current version of this architecture, the address
(which is heirarchical) in involved in creating those routes which take the
least amount of effort to compute (sometimes called "no-brainer" routes),
that route being basically heirarchical. (It is this latter function which may
prove inefficient to overload onto the address of a network attachment point,
necessitating the creation of a different object class or naming space for
network attachment points.)
Note that all these functions are usually performed by a single
kind of name, a structured address. However, other possibilities have been
suugested.
One proposal is that each network attachment point be given a unique
binary flat identifier; this would allow the creation of multiple,
independant, structured namespaces (i.e. address spaces), each of which have a
separate mapping into the first namespace which identifies all attachment
points. The claimed advantages are that it would be possible to experiment
with new addressing schemes, or introduce a new one, or even operate with
several different ones in use at the same time.
This possibility has been discarded in this architecture, however. The
whole point of an addressing system is to allow traffic to flow between all
the myriad parts of the Internet. To do that, it is necessary to pass around
routing information, and the form in which this is passed around is inevitably
that of addresses. A new address scheme, resulting in an address form which
the old routing architecture did not understand, would inevitably interrupt
this flow.
Moreover, should it be necessary to supersede the addressing scheme in
this architecture (if adopted), it would be just as easy to create a new
mapping from node identifiers to attachment points as it would to create a new
mapping from attachment point names to attachment points. Alternatively,
mappings could be created from old attachment point names to new ones
directly. No good reason can be seen for the flat namespace for attachment
points, so incurring the cost of it should clearly be avoided.
4 Routing and Addressing Architecture
It is necessary to consider some ramifications of some of the
requirements in detail, and the steps taken to meet them, before the entire
architecture can be outlined.
4.1 Requirement Ramifications
The two major ones have to do with the trust model requirement, and
the ability to expand the attribute list.
Briefly stated, the trust model problems is that some users may have
policies on which links they will and will not use that cannot be described
in anything less than a general purpose computational notation. Moreover, in
the most extreme cases, some users may not trust outside agents to pick a
route for their data, even given a complete statement of their requirements.
This means that computations in intermediate nodes cannot be used to route
traffic. This flies in the face of the current architecture, which assumes
traffic is routed on a hop-by-hop basis; i.e. each switching node makes an
independant decision on what to do with the traffic.
The attribute list problem is a more general consequence of the
specific observation above, in the discussion of the requirements, that it is
probably not possible to completely specify what policy attributes will be
needed at the moment. If the new system is to last, it must be able to change
and expand its capabilities. To do this, it is necessary to be able to invent
new types of attributes for links, and phase them in incrementally (since
such changes cannot be coordinated instantaeously across a large system).
This presents a problem, since this seems to violate the restriction given
about in the discussion of LS algorithms, namely, that all the nodes run the
same route generating algorithm. If one node is taking certain attributes on
links (and requests to use them in packets) into consideration when
generating routes, and another is not, this can potentially form routing
loops.
These two problems appear substantial, yet both can be solved by the
same mechanism. In addition, that mechanism allows yet more problems to be
solved, as detailed below.
4.2 Architectural outline
The general outline of the architecture is as follows (each element
will be discussed in more detail below).
The routing architecture will be a multi-level area (i.e.
heirarchical) LS system; in the topology graph representation of the network,
links and nodes have attributes. The exact algorithm for generating routes,
given the topology map, is not specified as part of the protocol (but a
default algorithm can be provided as an appendix). Abstraction of data will be
determined primarily by external mechanisms (although a simple default will be
provided either as part of the protocol or as an appendix), and in any case
clients are not bound to accept any abstraction unconditionally. Routes will
be completely determined by the initiator of the traffic; i.e. all packets
will be source routed. Packets will belong to ephemeral associations called
'flows'. (There may be other forces acting on the architecture which make
flows desirable; if so, several things can be done with this new mechanism.)
A new kind of address will be created, for use in making routing
decisions. The existing IP address field will be retained as well (exactly as
is in the short term), but the contents will be put to slightly different use
as a node identifier.
In the short term, hosts will continue to operate with no changes at
all, and use the existing packet formats for their interactions with routers,
although there will be certain optional optimizations that speed things up,
but are not necessary. In the long term, a transition to a longer version of
the node identifier (the old IP address) will be necessary. A number of
factors will probably eventually combine to cause the definition of a new IP
packet format, but most of those forces are outside the scope of this
document; however, phasing this longer version in will probably occur as part
of that transition.
A key thing to notice about this architecture is what is not part of
the architecture and protocol. Neither the algorithm to create routes from
the topology database, nor the exact method by which abstraction occurs are
part of the architecture. These are also the two hardest problems, and the
two in which future work is most likely to bring improvements. It is a key
feature of this proposed architecture that these two areas can be left
outside the scope of the protocol; future improvements can be easily phased
in incrementally.
Route generation in this architecture will need routing algorithms
which have a different goal from most existing work, such as the Dijkstra
algorithm. Most known routing algorithms create an entire routing table of
optimal routes to all destinations; since this has been what was needed in the
past, this is the area in which most research has occurred. The problem of
finding the optimal (or a good approximation thereof, within limits) route
between two points in a graph, ignoring other destinations, has not been
a topic of much work. This only increases the need to leave flexibility in
this area; improvements in algorithms and heuristics tend to be slow to
appear.
The things which are part of the routing architecture and protocol are
a way of representing and characterizing the network topology, a way to
distribute this information, and a way to set up traffic flows. Everything
else can be left and specified later; a key to the power and flexibility, as
well as the ability to last (gracefully), of this architecture.
Upon some reflection this seems likely to be a theoretical minimum
for performing those tasks. It is not clear whether the fact that the two
hardest problems are left outside is fortuitous or a reflection of some
deeper correctness. In any case, the minimal design does limit the amount of
work to be done, as well as limit the possibility that something will be done
wrong or in an inflexible way.
"Perfection has been attained, not when there is nothing left to add,
but when there is nothing left to take away." [Corbato]
4.2.1 Multi-level LS Algorithm
The LS class of algorithm is chosen, basically for the reasons listed
above in the discussion of these algorithms. It is really the only practical
way to handle a system with complex acccess restrictions on links which form
a key part of policy considerations. The expandable attibute list allows
future growth.
The algorithm has to be multi-level to handle the very large amount of
link state in the system. Even if the system were restricted to distributing
information about AS's, the general consensus is that a single level will not
be able to handle the number of AS's we expect to be deployed in the
reasonable future.
However, it is also felt that the time and need for AS's, and the
strict division of routing in inter- and intra-AS, is past. AS's are a limited
and simplistic mechanism that was created long ago to fulfill certain limited
goals, primarily that of providing routing firewalls; since the new
architecture can do all these, and better than the AS concept, it is now
entirely appropriate to discard AS's. (This does not mean that the concept of
policy groupings would be discarded; far from it. What it does mean is that
there will not be a particular layer which is called out for special
treatment.)
The primary reason for the introduction of AS's was to provide some
routing firewalls in the system. Since the new architecture will problem more
flexible and robust firewalls, retaining the archaic AS mechanism is
pointless. An additional advantage of AS's (and the division of the task of
routing between IGP's and EGP's) was to allow different routing technologies
to be used. The extreme power and flexibility of the new architecture makes
this less useful, and in any case, the fact that the process by which the
representation of an area is constructed (the abstraction process) is not in
the scope of the protocol means it would still be possible to use other
routing technology within an area, and pass the results of that process up as
the representation of the topology of the area.
The new system would be responsible for all routing in the IP
architecture; it would be able to depict, at its lowest level, the physical
transmission assets of the system. This unification of routing would not only
reduce the complexity (and likelihood of problems, implementation and
configuration cost, etc), but bring the full power and flexibility of the
new architecture to bear at all levels. A number of problems exist today
because of the split between inter and intra-AS routing; removal of this
artificial barrier allows greater flexibility.
The use of a link-state architecture does bring with it the attendant
difficulties of how to abstract link-state data. The use of thinning,
necessary for useful abstraction, brings with it the difficult task of chosing
which data to discard, and the problems caused by loss of the discarded
information. As noted above, this cannot be solved by purely technical means;
value judgements must be made about which information to discard. It will be
usually be necessary for the administrators of any given area to help decide
what the representation of that area will be when the internal details are
hidden.
Since this is now a problem of entering configuration information, it
is desireable to do this in a way that allows no chance for inconsistent data.
A default algorithm will also be provided, for those who do not care what the
representation of their area is, or are unsophisticated and cannot participate
in chosing the representation of their area. The fact that there is no
specified algorithm for doing this also allows improved default algorithms to
be incrementally deployed as they are conceived.
However, the problem of non-optimal or unfound routes remains. The
solution to this part of the puzzle requires the next key component.
4.2.2 Source Routing
The source routing is a necessary consequence of both the trust model
part of policy considerations, and the expandable attribute list, as outlined
above. Having the source set the route avoids both of these difficulties. As
mentioned, it also allows the incremental deployment of better algorithms for
creating routes as ongoing research provides them, and also provides a means
to attack the non-optimal route problem, as noted immediately above.
Note that the route (nor indeed, the new style address) is not
necessarily carried in each and every packet; this is an engineering decision
to be made on the basis of the costs and benefits of doing so. The costs are
primarily the increased overheard of carrying around the extra data. There is
another issue in that the existing IP packet format will probably not allow
'on the fly' modification to hold the source route, since the latter will
probably be too large to fit into the IP header. Depending on the length of
the addresses, these could be a problem as well. This could be solved by
'wrapping' the packets, but at considerable cost in complexity and switching
speed (and some in line utilization).
The benfits of carrying the source route and/or address in each packet
are that the routers may remain more stateless. It is anticipated that routes
will not be carried in packets, and there will be a 'route set-up', as part of
the 'flow set-up', which happens invisbly to the client of the packet layer.
This flow will not be critical state, like a connection, but rather a hint. (A
'hint' is data which allows processing to be optimized, but which is does not
have to be present to allow correct operation.) This may be recreated by the
routers without reference to the source of the flow setup should any
intervening node fail.
Obviously, since the source (or an agent close to it) choses the
entire route, it is trivial for the source to enforce whatever constraints it
wishes on the path that its traffic takes (even constraints which are not
stateable in anything less than a general purpose computational language).
Alternatively, if the agent computes routes using links with attributes it
understands, but which the intervening switching nodes do not, the traffic
will be routing correctly, their lack of comprehension notwithstanding.
Also, if in computing a route, the agent generating the route comes
across an attribute it does not understand (either because it was recently
introduced, and the agent does not yet know how to deal with it, or because
it is a private attribute which the agent will never understand), the
agent will still be able to compute a route. If all attributes carry some
limited information about whether they are restrictive (certain types of
traffic will not be allowed across) or informational (giving information
about the link) it would be possible for a route to use links which have
attributes the agent does not understand.
Since the entire route is calculated by one agent, it clearly does
not matter if different agents use different algorithms for generating
routes. Deployment of new algorithms, or a choice between several existing
ones (for example, one which is fast but which does not produce good routes
would be useful for one-shot traffic, whereas more complex and expensive
ones would be appropriate for long-lived flows sending a lot of data)
thus becomes trivial. [SalzterSR]
4.2.3 Local Abstraction Control
As for the non-optimal route problem, it was noted above that the use
of hop-by-hop routing required both a consistent algorithm and a consistent
database. Discarding hop-by-hop routing allows relaxation of the latter
requirement as well. This allows a powerful (and previously unconsidered)
adaption of the typical operation of an LS algorithm. Canonical LS algorithms
require everyone to have an equivalent map.
("Equivalent" is used instead of "identical" since in existing
multi-level protocols such as OSPF, agents in different areas will each have
(identical) top level maps, as well as maps of their own area, but of course
these latter maps are different. All routers within a single area, however,
will (and must) have identical maps.)
However, there is no requirement in this architecture that this be so.
Neighbouring routers at the same level may have wildly different maps. One
might have only a default representation of a neighbouring area; another,
which is sending a lot of traffic into that area, might have detail on the
internal connectivity of that area, to allow it to pick the optimal entry
router into that area for the traffic it is handling. Any node may call for
more detailed topology information about any part of the system it wants,
provided that that information is accessible to it, and the cost of providing
that information are borne somehow.
However, the implications of lifting this restriction are
considerable. As noted above, the thinning problem is essentially a
cost-benfit tradeoff; making this tradeoff at the area which is being
described imposes the area's idea of the correct cost-benefit tradeoff on
everyone. However, allowing each route generator to make its own decision on
how much detail to keep in its maps allows this cost-benfit tradeoff to be
made at the client, i.e. with far greater flexibility.
It will be necessary for the costs of keeping maps with more detail
to be shifted to those agents which wish to keep them; this will require
attention at the time that accounting and resource usage algorithms are
brought to the network.
The default abstraction algorithm is thus seen as that which will
provide the theoretical minimum of data necessary to route traffic through the
system. The actual routes taken by traffic under this algorithm will thus be
strictly heirarchical. Looked at another way, local abstraction control thus
allows a way to do non-heirarchical routing.
5 Details of the Architecture
Many of the components necessary to this routing architure and
protocol can be 'borrowed' from the work being done on the Open Routing
system, and similar LS systems such as the ARPANet algorithm, OSPF [Moy], etc.
These include the mechanisms to do reliable database distribution and flow
setup. According, description of these solved problems will be skipped in this
document.
This section includes more detail on certain key parts of the
architecture, along with calling attention to those parts in which engineering
questions remain.
5.1 Multi-Level Topological Representation
The first step is to consider the maps (graphs, actually) which
represent the actual detailed topology. It is worth noting that in the graph
representation of typical networks, there are usually two kinds of nodes;
'network' nodes and 'router' nodes. The arcs then represent network
interfaces. It is possible to have only a single kind of node (either
networks or routers), but this is artificial, loses valuable information, and
is more difficult to work with in practice than the more complex version.
This architecture will almost certainly use this type of model.
In the maps, it will be possible to take an contiguous section of the
graph (i.e. some number of nodes and their connecting arcs), and reduce it to
a simpler representation. Such a reduced section is called an 'area'. Areas
may be nested, and areas have as an attribute a level; an area which contains
another area has a level one higher than the contained area.
It is unclear whether an area will need to be represented as a section
of graph, or just a node. In either case, a new type of node for an area is
probably needed. The issue probably hangs on whether the rules for
constructing topologies demand router nodes between area nodes or not. In the
former case, the minimum representation of an area would be the collection of
border routers plus a pseudo-node for all the rest of the topology of the
area, to which the border routers are connected. If the area has complex
transit policies which vary depending on which pairwise set of border routers
are picked, a more complex representation allowing this to be expressed would
of course be necessary.
It is not clear whether it is best to allow only 'strict' reduction,
in which an area of level k would be allowed to contain only obects of level
k-1, or whether 'loose' reduction, in which objects of varying levels less
than k could be contained, is preferable. The latter is clearly more flexible,
but might make the engineering more complex. This decision can probably be put
off until the detail design happens. What is important is to realize that the
division into areas provides the framework for the default abstraction
algorithm; agents outside an area will not in general have detailed
information on the internal topology of the area.
On the subject of border routers, it seems wisest to set the rule
that all routers belong to a single area. This is a good match to reality,
where most of the physical assets have an owner; the half-router model has
proven to be less useful in practise.
One aspect of representation that needs to be addressed is the
question of configuration errors; this is the most likely place for such
errors. One technique that appears useful is to simply indicate to each router
whether or not it is a boundary router for a k level area, not which k level
area it belongs to [BBN]. Thus, misconfiguration of a router simply joins two
k-level areas togther, but has no other ill effects on the routing. Similar
approaches must be taken throughout this aspect to minimize configuration
and prevent configuration errors from being a major problem.
In any case, it is clear that some decisions about open questions
need to be taken about the multi-level topology representation before any
other parts of the design can be worked on in detail. Obviously, these
decisions may need to be revisited if work elsewhere reveals that a poor
choice has made the design of some other part more complex than need be.
5.2 Multi-Level Addresses
The address format chosen will need to map closely onto the
representation of network topology, which in turn is chose to make the job of
the routing simple. Remember, the address is simply a way of naming an
attachment point to the network, in other words, specifying a place in the
graph, and it is essential that given an address, it must be easy to correlate
that address with its location in the topology graph easily. (It might be
argued that the 'address' need not have that property, but it might map to
something else which does. All that is saying is that terminology has been
confused, and the wrong name for a network attachment point has been called
the address. The final concept will inevitably have to be one which is easy to
locate.)
In essence, what is proposed is a multi-level heirarchy, with a
variable number of levels, each of variable length. This may seem like
overkill, but remember that these addresses are only needed to create flows,
and since the cost is minimal it is better to have too much flexibility than
too little. The mapping from addresses to the map is fairly simple; the first
element of the address identifies which top level area an address is in, the
next element which of the next level areas within that top level area, and so
on.
In any event, the bottom level will be the level representing real
physical assets, and numbering will proceed up from there. The advantage of
this is that two different systems with differing address spaces, each
allocated without reference to the other, can be joined by the simple
expedient of creating a new level above the top of each existing system and
making each system appear as a simple object in that level. [Reed]
Alternatively, a small independantly numbered system could be joined onto a
large existing system as a branch at the appropriate level.
This will also avoid useless arguments about whether or not something
deserves to be a 'top-level' area or not. Since there is no such thing as a
top-level area, pointless fights over status should mostly be avoidable. Note
that good functioning of the resulting system demands proper engineering of
the topology, and of the areas; since the area definitions provide the
framework for the abstraction process, which in turn affects the default
heirarchical routing, a poor choice of area boundaries will cause the default
abstraction process, and the default heirarchical routing, to work less well,
although it will continue to function.
The analogue in the adressing area to the question of only allowing
'strict' areas is whether or not networks (and presumably hosts and routers
as well) can have numbers at any but the lowest level. If a k level area is
allowed to contain physical assets, presumably they can be addressed with
k-1 level addresses, without going all the way down to the bottom level.
The question of exactly what form the addresses of networks and
routers (if any) take is open. In the current architecture, no real provision
was made for numbering networks, although the convention of a zero 'rest'
field has come to mean the network. Routers are currently indistinguisable
from hosts (which is good and bad). In the new systems, hosts will not have
addresses per se; the lowest level item which can be addressed is a network
interface, and presumably typical hosts will have one each.
Once again, choices have to be made, but with the understanding that
they may have to be revisited in light of work elsewhere on the architecture.
5.3 Topology Information Flooding and Route Generation
It is next necessary to consider the question of which devices perform
these functions. It seems most plausible that the first function is co-located
in the routers themselves, for reasons of fate-sharing; connectivity
information is flooded through the very connection elements that are being
described. This also removes dependency loops where topology flooding agents
which are not routers are required to make contact with neighbour topology
flooding agents through routers.
Route generation could, however, quite easily be performed in separate
devices, which communicate with routers (both local, and distant if they wish
to have detailed information about distant areas). It might be useful to build
this function into routers, but this is an implementation decision. Certainly,
clients with complex policy requirements will probably desire to generate
their own routes.
5.4 The Default Abstraction Algorithm and Local Abstraction Control
Given the discussion above, the default abstraction algorithm is very
simple. As stated above, the division into areas provides the framework for
the default abstraction algorithm. The routers which are border routers for an
area are responsible for doing the abstraction on the area topology when it is
advertised outside the area.
Two possibilities exist for a simple algorithm to provide a default
representation of an area. One, the N^2 model, maps the area as a complete set
of connections between all the border routers. This model can fairly
accurately represent the true characteristics of the connection between any
two pairs of routers. (Of course, how to do so in the presence of policies
and types of service, where a multitude of possibile connections are possible,
will require a bit of work; probably the least policy-limited path should be
advertized.) In the other, the pseudo-node model, all the border routers are
connected to a single node in the center. This cannot accurately represent the
metrics between any two border routers, but an averaging system will give a
good guess.
The option is always available to go to a representation with human
input, or to a representation prepared by some other algorithm. Some control
needs to be created to prevent areas from flooding the system with
unnecessarily complex representations of themselves.
Of course, if an area is represented solely by a single pseudo-node,
none of these problems arise, which is the attraction of this possible way of
representing an area. The disadvantage is that then areas must have transit
attributes which mirror the attributes (both policy and type of service) on
the actual transmission links which connect the border routers.
The working of the local abstraction control (somewhat inelegantly
named the "Blister Model") is simple to understand in this context. Generally,
any routing agent outside an area is given, as a default, whatever
representation that area has decided to purvey. However, should a routing
agent wish a more detail of a non-local part of the map, nothing is to stop it
(other than information-hiding access control, of course) getting the
information and upgrading its map.
5.5 Virtual Links and Flow Repair
One concept which will probably appear as part of the representation
of an area is that of a virtual link. Where it is desired to describe the
connection between two border routers, without providing detail on the
constituent links which actually connect the two, a virtual link (with
the attributes of the complete path between the two) could be advertised.
One key aspect of this idea is the ability to do local repair on flows
when topology changes happen. If a virtual link at level k, which was
advertised to a source route generator and which is used in creating a flow,
suffers a component failure, two outcomes are possible. First, the virtual
link may be reconstituted with the same attributes using other lower level
assets. In this case, the repair would be local, with no necessity for
intervention on the part of the source route creator. As an option it might be
notified if some attribute, such as delay, increased more than an allowable
amount on some component it used in setting up a flow. Second, if such a
reconstitution cannot be made, the k+1 level virtual link of which the k level
link is a constituent is notified, and the algorithm is then run recursively
at a higher level. Only if some asset which the source route creator fails
would it be necessary for the creator to select a new path for the flow.
5.6 Partitioned Areas
Handling of partitioned areas needs to be addressed. A number of
methods are possible. One obvious method is to reconnect the two parts of the
partitioned area by going through another area.
Another elegant method gives each level 1 area a globally unique
identifier. A k level area is given the identifier (at k level) of the lowest
(or highest, or some unique algorithm) numbered k-1 level area within it.
Then, if a k level area partitions, it automatically creates a new k level
area, which is automatically numbered from its constituent k-1 level areas.
The disadvantage of this is that the address of everything within the new k
level area has changed, which is probably a substantial disadvantage (although
not insuperable).
5.7 The Node ID
The node identifier (what used to be the IP address) is a key part of
the system; not only does it provide new capabilities, but it makes the new
system incrementally deployable. Instead of being a number with structure, as
it is currently (which is causing inefficient use of the number space), it is
simply a unique 32-bit (initially) number. That way, we put off the day when
that address space runs out, since we can use it efficiently instead of
wasting large gaps, as we do now.
A number of outstanding issues, particularly mobile hosts and
multi-homed hosts, can be solved with this, with no change to the upper
levels. An open TCP connection to a given node identifier will stay open even
if the node moves and gets a new address. (The flow may need to be rehomed,
but this will be invisible to the TCP, and perhaps to the host entirely.) If
an explicit mapping exists from a node identifier to multiple addresses, many
issues having to do with multi-homed hosts (both for reliability as well as
bandwidth) become much simpler.
The source and destination of packets in the network will continue to
be identified by these identifiers. If we chose not to include the new
addresses in each packet (as seems likely), a packet arriving at a router is
assigned to a particular flow by inspecting these numbers. This process will
be at least as quick as current routing, and if the addresses are not included
in the packets, there will be no extra line overhead either.
5.8 Mapping to the Address, from the Name and from the Node ID
Obviously, a system for providing the new addresses, and mapping
between them and other identifiers, such as host names and identifiers, needs
to be available. It clearly ought to be distributed and replicated. We already
have a distributed replicated system for containing mappings; the DNS. It
would probably be simple to add this mapping as well.
When going from the string name, the obvious thing is to return the
new address as well. For clients which need to map from node identifiers to
addresses, some analogue of IN-ADDR will have to be provided.
One thing to be careful of is that in the process no dependency loops
are formed. For instance, error reporting might need to get an address if all
that is on hand is a packet with the node identifier. The actual process of
getting routing information around would have to be examined carefully to make
sure it does not rely on the existence of this mapping system, otherwise
dependency loops will certainly form.
5.9 A Sample Route Generation Algorithm
One simple way of generating a route in a complex policy environment
is to run over the map, dropping all links which are unusable for policy
reasons or because they do not match the type of service desired. It would
then be possible to run the Dijkstra algorithm to create a routing tree, and
thus a route, for a given metric which is it desired to minimize (probably
delay or cost). To optimize several metrics at once, a formula for weighing
the metrics together to create a composite metric needs to be provided; as
each link is examined in the Dijkstra, the composite metric will be calculated
from the exact metrics of the link, and the composite metric will be what the
Dijksta minimizes.
Metrics such as bandwidth, or MTU, where minimization over a path is
inappropriate, would be handled by dropping inappropriate links in the first
phase; if no route can be found, the actual requirement on this metric could
be lowered and the process repeated to see if a route appears when a less
optimal value is allowed.
5.10 Authentication and Robustness
Making the system robust against attack is vitally necessary. One
approach would be to tag all data with a private key system; this guarantees
that topology information could not be modified as it is flooded through
the system. The same approach could be used in flow setup; acknowledgements
tagged with the private key would prevent the flow from being hijacked to
a different destination by a corrupted switch (although it could be copied).
Protection against denial of service attacks are more difficult. In
the above example, the corrupted switch could prevent the flow from being
set up, although it cannot make it appear that the flow has been correctly
set up when it cannot. Of course, if the switch refuses to set up the flow,
an alternate path could be picked which avoids that switch.
Another difficult problem is bogus connectivity information. Some
checks can be made, such as ensuring that both ends of a link agree that it
exists, but this will still not catch collusion. Once again, it can be
detected that this is happening if the flow does not set up correctly, and the
errant switches avoided. However, if the switches claim a direct connection,
which causes traffic to flow to them, and then route the traffic through,
it will be difficult to detect except by measuring the service provided.
6 Possible Optimizations
Although this document is not an engineering design, there are a
number of possible optimizations which are worth discussing here. Many other
such optimizations are possible. They have not been considered in detail here,
since they are generally low level engineering and depend on the exact details
of the structure chosen. The key focus in this document is to discern the hard
problems, and pick a broad line of attack on them.
6.1 Default Routing Tables
One useful optimization which might be included is to provide a
facility for handling traffic without the overhead of calculating a route
and setting up a flow. [Estrin] Needless to say, this could only be done
for a limited number of classes of traffic, but it is possible. Basically,
a way would exist to indicate that traffic was not part of any flow, and
one (or a small number) of default routing tables for such traffic would be
precomputed, in all routers, as is the current practise.
This would not be unreasonably expensive. For example, assume that
the system contains 5 levels, and the fan-out at each level is on the order
of 200. Thus, the system could handle up to 200^5 networks, or 3x10^10
networks; this should be large enough for the reasonable future. Such a
system would only mean that the average router with a minimal map would
need a map with 5*200*(3+1), or 4000, nodes (assuming that routers outnumber
networks/areas by a 3:1 ratio, which seems about right from the current
system; this gives triple connectivity to each network, or fairly redundant
connectivity). Assuming each router is connected to 3 networks, then the
map would contain 5*200*(3*3), or 9000, arcs. Such a map would be fairly
easily handlable by even the current generation of router hardware.
The problem with this idea (within this architecture) is that a
mapping must still be created from the destination node identifier to the
address. How does this mapping happen? One possibility is to have to have
first hop router (or the host) add it to the packet. Alternatively, each
router along the path would have to do the mapping. Clearly, the mapping in
each intermediate router could be installed via a setup, but this would just
get us exactly what we are trying to avoid!
Also, it is not clear if this optimizations will be useful in the
future. It remains to be see whether if, in a system in which policy
configuration and access controls are much easier, and thus more common, much
traffic is sent out in the default class, and if the free, general access
model of network use we have enjoyed continues. If not, the concept of default
routing tables is probably not a useful one, since almost all traffic would
need a flow set up to handle it.
6.2 Map Discarding in Stub Areas
One additional idea, if default routing tables are included, is to
support stub areas. The complaint might be raised that even supporting a
map with 4000 nodes (from the example above) is pointless if an area is
only singly connected to the outside world, or neither wants to support
transit traffic nor pick optimal area exit routers. In these cases having
the topology map for the rest of the system outside the area is
unnecessary; the traffic is simply sent to any border router, and is routed
from there.
(The former simplification is obvious; doing either of the latter
two means that traffic must be able to differentiate as to which border
router is the best one to head toward; i.e. traffic inside the system must
have access to a full outside map to know which border router is the best
exit router, if routing loops are to be prevented.)
6.3 Flexible "No-brainer" Routes
One major potential issue with the current architecture is the fact
that the simplistic "no-brainer" route is that which is computed with the
minimal map, and that minimal map implies strict heirarchical routing.
The problem with that is that a K level area, such as a campus, which
in actual physical terms has connections to several long-haul networks, each
of which is probably a separate K+1 level area, has an insufferable choice to
make. Depending on which K+1 level area they associate themselves with, any
traffic using the simple "no-brainer" routing will get to that K level area by
means of the long-haul network with which the K level area is associated in
its global address. This is the problem referred to some while above, where it
was indicated that perhaps using the address to generate the "no-brainer"
route is an overload of that name which finally breaks down.
One solution to this problem involves the use of what are called
"route suffixes" [Clark]. When the host-name to address translation is
performed, the address which comes back comes with several incomplete source
routes, which represent potential useful paths, terminating at the
destination, through the nearby topology. If the most detailed (last) item in
the route suffix is not know to the source, it backs up and tries the previous
(less detailed) item, and so on. Given several route suffixes, probably one
for each major long-haul network which an area is attached to, the source can
easily make a determination as to which long-haul network it prefers to
use to get to the destination.
The major advantage of this scheme is that is it fairly efficient; a
small amount of extra data is passed back at the time of the address lookup
which allows a choice to be easily made. What mechanisms are available in this
architecture to do this?
One rough equivalent in this architecture can easily be found by
allowing each network attachment point to have multiple addresses; this would
indicate that a destination could be reached via multiple long-haul networks.
(This is equivalent to allowing K-level areas to overlap.) This solution is
not preferred since it conflicts with a previous goal of the address, which is
to provide a way of concisely describing the topology. If any network
attachment point can have many names (addresses), this will be more difficult.
In effect, this approach overloads yet one more function onto the
address, which is to optimize picking a slightly more optimal route; the
address is not really suited to having to support this one further function,
however. There are a number of ways to do this which do fit well within the
architecture, though.
The first is a slightly more general variant of the original scheme,
which is to use the local abstraction control mechanism to discover some
details of the topology around the destination. This is seems to be less
efficient than the "route suffix" mechanism, but has the same effect; the
source is given more information which allows it to make a choice.
The second, in some senses a variant of the first, is to remove from
the map all areas which do not allow transit traffic; i.e. "stub" areas. This
would greatly diminish the number of nodes in the map, and allow more detail
to be kept. Whenever a source needed to send traffic to a destination in that
stub area, it could pick up topology information about that area and enter
it into the map.
Both of these represent a slightly more formal way of doing
essentially what the route suffix does, which is provide more routing
information specific to the destination. The difference is that the
information in the route suffix is picked by the destination, as opposed to
the source being given general information about the topology near the
destination.
One final option is to provide essentially exactly the route suffix
mechanism. As alluded to above in the discussion of fundamental object
classes, it might be necessary, if this optimization is deemed important, to
provide a new object class, that of route suffix. Rather than overload this
function onto any of the existing object classes (and the names for those
objects), it is perhaps best to simply bit the bullet and create the new
object class.
7 Deployment
A great advantage of this architecture is that it can be deployed in
an incremental fashion, and furthermore that it does not require any changes
in hosts for fairly full operation (although minor changes would improve
things somewhat).
The following section is a first crack at defining how the deployment
would work.
7.1 Incremental Deployment in the Routers
The system can fairly obviously be interoperated with the existing
architecture provided host identifiers continue to be allocated along the
existing lines of network numbers. This will allow the system to be deployed
and brought up incrementally in the routers.
In an old-style router, it will still be given lists of network
numbers, and the metric to each. In new-style routers, areas of the network
which are being serviced by old-style routers will be masked by 'transitional'
routers which provide a mapping from the old routing protocol to the new
system.
Of course, as the number of networks mounts, the existing routing
will break down, and render the old-style routers non-functional. By that
time, basically all the main routers should have been converted to use the
new system, and the old routing mechanisms can be turned off. Stub routers
which use default only can continue to operate as before, obviously. Once the
old-style routing mechanisms are turned off, allocation of host identifiers
can proceed in a more flexible and space efficient form.
Obviously, old-style routing could continue to be used once this
happens, but hosts served by old-style routers would probably not be
able to get access to the new hosts, since the new host identifiers would
not contain any topology information the old-style routers could use to
route the traffic. One possible way around this is to have all traffic
to what looks like new network numbers send to a new router which would
perform the mapping in a similar way to the way hosts are supported.
7.2 Incremental Deployment in the Hosts
For the first stage, no changes would be absolutely necessary in the
hosts at all. The host would translate from the name to the host identifier,
and send out an existing packet to the first hop router.
The first hop router would then do the flow setup, using some default
policy attributes. The destination network address would either be looked up
separately, with attendant overhead, or, if the router overheard the name
lookup, it might have cached it.
The issue of network masks and node identifiers needs to be thought
about to make sure that getting from hosts to neighbouring hosts, as well as
first hop routers, continues to work.
Clearly, if node identifiers are assigned without respect to topology,
then the mask and match process can no longer be used to determine whether or
not a given destination is on the local wire. Doing anything else means that
the address space will not be allocated inefficiently, bringing us back to
where we were! However, given the practical difficulties involved in true
random allocation of node identifiers, it may be necessary to allocate in
blocks.
One issue is involved in the translation of node identifiers to local
hardware addresses. Some possibilities exist involving use of ARP, but these
are nasty. For networks which depend on direct mapping from the node
identifier to the network address, it is obviously unavoidable to allocate
chunks of the node identifier space.
Another problem involves making sure that each host understands how to
get to its first hop router for off-network traffic. If hosts exactly follow
the Host Requirement RFC, and are set up with an all ones subnet mask, then
any attempt to send traffic to another host will perforce go to a router. The
default router table will have been filled in via the Router Discovery ICMP
mechanism, and the router's local network address will be resolved as
discussed above. This is inefficient for traffic on the local network,
however; if the all ones mask approach is used, routers will also have to stop
sending Redirects!
If a proposed document on routing in the host is worked on (or
alternatively a revision to the Host Requirement RFC is put in hand), this
topic should be carefully considered to make sure the proposed mechanism will
integrate well into this new approach.
Eventually, the process could be optimized by slight changes in the
host. It would ask for and remember the new type address when it contacted the
name server; if it had special policy requirements (or charge authorizations,
or whatever) it would include these in the packet it sends out, either in the
initial connection packet or in a special communication to the first hop
router, or whichever box is doing the flow setup.
7.3 Upgrading to a Long Node ID
A 32-bit system wide node identifier is clearly unsuitable in the
long run. One possibility for the latter is to make the UID's locally
(perhaps pairwise locally) significant only; however, this is complex, and
loses some of the advantages of having a node idenfitier.
The other main possibility is a new IP packet format with 64 bit
UID's. A new packet format might be desirable anyway, for other reasons. If
the 64 bit UID path is chosen, and the phaseover started before the 32 bit
space is used (so that the UID of any node in the new 64 bit space is simply
the zero extension of its UID in the 32 bit space), there will not be any
cases of new and old style hosts which cannot communicate due to the inabilty
of the old address space to name hosts in the new space, which is the usual
cause of problems in conversions. Once basically all hosts have been
converted to the new format, use of the rest of the identifier space can
proceed. This is probably the preferable path.
A number of ways of interoperating old and new style hosts exist.
The first hop router might automatically convert old-style packets to new
style. The advantage of this is that hosts can throw out the old code; the
disadvantages are the cost of the conversion to and fro when two old-style
hosts are conversing. Another possibility is that hosts might continue to
understand both old and new.
In any case, there are four cases to be considered; old-old, old-new,
new-old, and new-new (the first being the source and the second the
destination of the packets). The tricky one is the new-old, since in any
system something must convert the format; either the source host needs to
find out that the host does not understand new-style (perhaps by trying
a new-style first and seeing if it gets a response) and send an old-style,
or the last-hop router, which needs to know which of its hosts are new and
which are old, needs to do the conversion. A messy problem however it is
sliced! One useful trick is to use spare bits in the headers to record
packets which were sourced in the other format, or hosts which can generate
and understand new format packets, etc.
8 Conclusion
The system proposed above consists of many elements, with the utility
of each created and increased by the interaction with others. The way in which
they were put together was not as straightforward as it now appears. A long
process of revisiting of requirements and tinkering with the design has
resulted in the complete structure now laid out, with each piece dependant on
others and part of a deeply integrated whole.
The system also contains a number of old ideas, put together in a
novel way, together with the addition of some new ideas. In both of these
aspects (the extreme interdependence of a limited set of ideas, developed over
time, together with the resuse of existing good ideas with some augmentation)
it resembles the process which created Unix. [Thompson]
It is believed that the proposed architecture detailed above meets all
the goals outlined in the requirements, and in particular the size, making
possible a extremely long (essentially indefinite) use of the IP architecture.
Finally, it is worth noting again the many instances here in which
the power and flexibility of the architecture are improved by leaving
something out (route generation, data abstraction, etc), rather than by
adding something.
References (Incomplete)
[BBN] ???; "Adding Areas to the ARPANet Routing Algorithm??";
1982??.
[Chiappa84] J. Noel Chiappa; "??"; 'dev-cgw' mailing list; 1984?.
[Chiappa91] J. Noel Chiappa; "The IP Addressing Issue"; Internet-Draft
Yyyyyy 1991.
[Clark]
[Cohen] Danny Cohen; "On Names, Addresses and Routings"; IEN 23;
January 1978.
[Corbato] Fernado Corbato and Jerome H. Saltzer; "Multics: The
First Seven Years?"; ??
[Dijkstra] Edgar W. Dijkstra; "A Note on Two Problems in Connection with
Graphs"; Numer. Math, Vol. 1, pp. 269-271; 1959.
[Estrin] Deborah Estrin, Yakov rekhter and Steve Hotz, "A Unified
Approach to Inter-Domain Routing"; In preparation.
[McQuillan] John M. McQuillan, Isaac Richer and Eric C. Rosen; "ARPANet
Routing Algorithm Improvments"; Bolt, Beranek and Newman Report No. 3803;
April 1978.
[Little] M. Little; "Goals and Functional Requirements for Inter-
Autonomous System Routing"; RFC1126; October 1989.
[Moy] John Moy; "The OSPF Protocol"; RFCXXXX; Yyy 1991
[Reed] David Reed; "??"; CSR Group Talk; 1979??
[SaltzerSR] Jerome H. Saltzer; "Source Routing?"; ??
[Saltzer82] Jerome H. Saltzer; "On the Naming and Binding of Network
Destinations"; Local Computer Networks, North-Holland Publishing; 1982.
[Shoch] John F. Shoch; "Inter-Network Naming, Addressing and Routing";
IEN 19; January 1978.
[Thompson] Dennis M. Ritchie and Ken Thompson; "The UNIX Timesharing
System"; CACM, Vol. 17, No. 7, pp. 365-375; July 1974.